Exclude DataNodes being removed from new Region allocation by CRZbulabula · Pull Request #17934 · apache/iotdb

CRZbulabula · 2026-06-12T12:25:23Z

Problem

When a remove datanode is in progress, the ConfigNode could still allocate brand-new Region replicas onto the DataNode being removed.

This is especially likely when the target DataNode was killed (e.g. kill -9) before the removal: the failure detector reports such a node as Unknown rather than Removing, and RegionBalancer intentionally keeps Unknown DataNodes as allocation candidates (to cope with an insufficient number of online nodes). As a result, a new region (e.g. for the internal root.__system database) could be assigned a replica on the dead node. That replica can never be created there (Connection refused), yet the metadata keeps the assignment and retries forever, so the RemoveDataNodesProcedure gets stuck and the target DataNode never disappears from show datanodes.

Observed timeline (from the report):

11:31:06.210  Submit RemoveDataNodesProcedure successfully
11:31:06.550  create SchemaRegion 8 on DataNodes [3, 7, 6]   # DN7 is being removed
11:31:07.444  create DataRegion 9 on DataNodes [5, 7]        # DN7 is being removed
              Failed to CREATE_*_REGION on DataNode 7: Connection refused  (retried forever)

Root cause

RegionBalancer.genRegionGroupsAllocationPlan gathers allocation candidates with:

getNodeManager().filterDataNodeThroughStatus(NodeStatus.Running, NodeStatus.Unknown)

A node-status filter alone cannot fix this: a DataNode killed before removal is Unknown (not Removing), because DataNodeHeartbeatCache.updateCurrentStatistics lets the failure detector override the Removing status back to Unknown once heartbeats stop. So the removing node still passes the filter.

Fix

Consult the in-progress RemoveDataNodesProcedure — the authoritative, leader-switch-durable source of which DataNodes are being removed — and exclude those nodes from the allocation candidates.

Add ProcedureManager.getRemovingDataNodeIds(), which scans the unfinished RemoveDataNodesProcedure(s) and returns the removing DataNode ids. This mirrors the existing pattern in ProcedureManager.checkRemoveDataNodes / checkRegionOperationWithRemoveDataNode.
RegionBalancer.genRegionGroupsAllocationPlan now filters those ids out of the candidate list, in the same spirit as the existing RemoveDataNodeHandler.selectedRegionMigrationPlans.

This keeps the existing "allow Unknown candidates when online nodes are scarce" behavior intact, and only removes nodes that are actively being removed. The removal procedure already guarantees (checkEnoughDataNodeAfterRemoving) that enough non-removing nodes remain, so this never spuriously triggers NotEnoughDataNodeException.

Test

IoTDBRemoveDataNodeRegionAllocationIT kills a DataNode that hosts regions, submits the removal, and — while the removal is still in progress — forces a fresh Region allocation by creating a new database. It asserts that none of the newly allocated regions land on the DataNode being removed (comparing against a pre-allocation snapshot of region ids, since the removing node legitimately keeps hosting its own pre-existing regions until each finishes migrating away), and that the removal then completes.

🤖 Generated with Claude Code

When a `remove datanode` is in progress, the ConfigNode could still allocate brand-new Region replicas onto the DataNode being removed. This was especially likely when the target DataNode had been killed (e.g. `kill -9`) before removal: the failure detector reports such a node as `Unknown` rather than `Removing`, and `RegionBalancer` intentionally keeps `Unknown` DataNodes as allocation candidates (to cope with insufficient online nodes). The new replica could never be created on the dead node (Connection refused), yet the metadata kept the assignment and retried forever, so the removal hung and the target DataNode never disappeared from `show datanodes`. A node-status filter alone cannot fix this, because the killed node is `Unknown`, not `Removing`. Instead, `RegionBalancer` now consults the in-progress `RemoveDataNodesProcedure` (the authoritative, leader-switch durable source of which DataNodes are being removed) via the new `ProcedureManager.getRemovingDataNodeIds()` and drops those DataNodes from the allocation candidates. This mirrors the existing filtering in `RemoveDataNodeHandler.selectedRegionMigrationPlans`. Add IoTDBRemoveDataNodeRegionAllocationIT: it kills a DataNode, submits the removal, and while it is in progress forces a fresh Region allocation via a new database, asserting that none of the newly allocated Regions land on the DataNode being removed and that the removal completes.

codecov · 2026-06-12T13:12:54Z

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.06%. Comparing base (abb9ef9) to head (9fb2e92).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
...nfignode/manager/load/balancer/RegionBalancer.java	0.00%	8 Missing ⚠️
...che/iotdb/confignode/manager/ProcedureManager.java	0.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17934      +/-   ##
============================================
- Coverage     41.07%   41.06%   -0.01%     
  Complexity      318      318              
============================================
  Files          5257     5257              
  Lines        365010   365023      +13     
  Branches      47180    47180              
============================================
- Hits         149918   149910       -8     
- Misses       215092   215113      +21

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Caideyipi

Reviewed the change and did not find any blocking issue. The fix correctly excludes DataNodes covered by an unfinished RemoveDataNodesProcedure from new Region allocation while preserving the existing Unknown-node candidate behavior.

Copilot

Pull request overview

This PR prevents ConfigNode from allocating new Region replicas onto DataNodes that are currently being removed (even if they appear as Unknown due to heartbeat loss), avoiding stranded replicas that can stall RemoveDataNodesProcedure indefinitely.

Changes:

Add ProcedureManager.getRemovingDataNodeIds() to derive the authoritative “removing DataNode” set from in-progress RemoveDataNodesProcedure instances.
Update RegionBalancer.genRegionGroupsAllocationPlan to filter out those removing DataNodes from region-allocation candidates while still allowing Unknown nodes in general.
Add an integration test that kills a DataNode, starts removal, forces new Region allocation, and asserts newly allocated Regions do not land on the removing DataNode and that removal completes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/ProcedureManager.java	Exposes removing-DataNode IDs by scanning in-progress `RemoveDataNodesProcedure`s.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/balancer/RegionBalancer.java	Excludes removing DataNodes from allocation candidates to prevent allocating new replicas onto nodes being removed.
integration-test/src/test/java/org/apache/iotdb/confignode/it/removedatanode/IoTDBRemoveDataNodeRegionAllocationIT.java	Adds regression coverage for “new Region allocation during remove datanode” scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The test re-opened a multi-node connection after kill -9'ing the target DataNode. EnvFactory.getConnection() fans every read out to all DataNodes including the dead one, so each executeQuery returned [Connection Error, null, null, null] and failed with InconsistentDataException at the connection site rather than exercising the bug. Pin both the post-kill connection and the allocation-probe connection to a surviving DataNode via a new BaseEnv.getConnection(DataNodeWrapper) overload.

CRZbulabula · 2026-06-15T03:52:30Z

Fixed the CI failure in `IoTDBRemoveDataNodeRegionAllocationIT` (e75e196).

Root cause: after `kill -9`-ing the target DataNode, the test re-opened a multi-node `EnvFactory.getConnection()`. That connection fans every read out to all DataNodes, including the dead one, so each `executeQuery` (`SHOW DATANODES`, region maps) returned `[Connection Error, null, null, null]` and failed with `InconsistentDataException` at the connection site — before the test could exercise the allocation behaviour.

Fix: pin all post-kill SQL (both the main statement and the allocation-probe connection) to a surviving DataNode via a new `BaseEnv.getConnection(DataNodeWrapper)` overload, which scopes write and read to a single live node. This matches the existing framework intent ("This is useful when you shut down a dataNode").

The remaining CI reds (`IoTDBConnectionsIT` on Windows, `IoTDBPipeClusterIT` on dual-table-manual-enhanced) are unrelated pre-existing flakes — they touch the session and pipe subsystems, not region allocation.

With Ratis schema consensus and schemaReplicationFactor=2, kill -9'ing one of the two schema-replica holders breaks Ratis quorum: the schema-region migration off the dead node fails (DO_ADD_REGION_PEER times out), so RemoveDataNodesProcedure rolls back and the node never leaves the cluster, making awaitUntilSuccess time out after 5 minutes. Raise schemaReplicationFactor to 3 (matching the proven IoTDBRemoveUnknownDataNodeIT#successTest config) so one kill still leaves a quorum (2 of 3) and the removal can actually complete. With 4 DataNodes, removing 1 leaves exactly MINIMUM_DATANODE = max(schemaRF, dataRF) = 3, so the removal is still permitted and the allocation candidate pool (the 3 survivors) is large enough for the forced new RF=3 schema / RF=2 data regions.

CRZbulabula · 2026-06-15T06:11:28Z

Second CI fix (9fb2e92).

After the connection fix, the IT got past the connection step and the core assertion (`assertNewRegionsExcludeDataNode` — new Regions exclude the removing node) passed, but it then timed out at `awaitUntilSuccess`: the removal never completed.

Root cause (from the ConfigNode cluster logs): schema Regions use Ratis, and with `schemaReplicationFactor=2`, `kill -9`-ing one of the two replica holders breaks Ratis quorum. The schema-Region migration off the dead node (`AddRegionPeerProcedure.DO_ADD_REGION_PEER`) times out because the old 2-peer config (one dead) can't reach majority, so `RemoveDataNodesProcedure` rolls back (`migratedFailedRegions:[SchemaRegion id:0]`) and the node stays in the cluster. This is correct product behaviour — it was a test-config mistake.

Fix: raise `schemaReplicationFactor` to 3 (matching the proven `IoTDBRemoveUnknownDataNodeIT#successTest` config). One kill then leaves a 2-of-3 quorum so the schema Region can migrate and the removal completes. With 4 DataNodes, removing 1 leaves exactly `MINIMUM_DATANODE = max(schemaRF, dataRF) = 3`, so removal is still permitted and the candidate pool (3 survivors) is large enough for the forced new RF=3 schema / RF=2 data Regions.

sonarqubecloud · 2026-06-15T06:17:39Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

CRZbulabula requested a review from Copilot June 15, 2026 02:25

Copilot started reviewing on behalf of CRZbulabula June 15, 2026 02:26 View session

Caideyipi approved these changes Jun 15, 2026

View reviewed changes

Copilot AI reviewed Jun 15, 2026

View reviewed changes

CRZbulabula added 2 commits June 15, 2026 10:37

Reflow awkwardly line-broken comments in region-allocation IT

77633be

CRZbulabula merged commit 08c55b9 into master Jun 15, 2026
43 of 44 checks passed

CRZbulabula deleted the fix_v2_990_alloc_excludes_removing_dn branch June 15, 2026 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude DataNodes being removed from new Region allocation#17934

Exclude DataNodes being removed from new Region allocation#17934
CRZbulabula merged 4 commits into
masterfrom
fix_v2_990_alloc_excludes_removing_dn

CRZbulabula commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Caideyipi left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CRZbulabula commented Jun 15, 2026

Uh oh!

CRZbulabula commented Jun 15, 2026

Uh oh!

sonarqubecloud Bot commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CRZbulabula commented Jun 12, 2026

Problem

Root cause

Fix

Test

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Caideyipi left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CRZbulabula commented Jun 15, 2026

Uh oh!

CRZbulabula commented Jun 15, 2026

Uh oh!

sonarqubecloud Bot commented Jun 15, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 12, 2026 •

edited

Loading